Add reader and writer for Puffin, indexes and stats file format#4537
Add reader and writer for Puffin, indexes and stats file format#4537rdblue merged 1 commit intoapache:masterfrom
Conversation
984dce9 to
e823511
Compare
e823511 to
99ce6e5
Compare
|
@rdblue i applied most of the comments (no more I switched writing to use |
|
@findepi, I would prefer to use |
core/src/main/java/org/apache/iceberg/stats/FileMetadataParser.java
Outdated
Show resolved
Hide resolved
ce65bcc to
5e88a18
Compare
|
( squashed and rebased on top of current version of #4534, no other changes in this push ) |
5e88a18 to
6086888
Compare
|
AC |
6086888 to
9e8bbc6
Compare
core/src/main/java/org/apache/iceberg/stats/StatsCompressionCodec.java
Outdated
Show resolved
Hide resolved
9e8bbc6 to
bcdd2f1
Compare
|
AC & rebased after #4534 merged. |
bcdd2f1 to
1b15939
Compare
872e253 to
9430d36
Compare
9430d36 to
b6472ec
Compare
|
Rebased after #5019 merged. |
nastra
left a comment
There was a problem hiding this comment.
overall LGTM, just a few suggestions here and there
| return type; | ||
| } | ||
|
|
||
| public List<Integer> inputFields() { |
There was a problem hiding this comment.
the spec mentions that input fields are JSON longs, so I'm just wondering whether this should be a List<Long>?
There was a problem hiding this comment.
Iceberg field IDs are integers, so the implementation is chosen to be limited to integers.
We can maybe change fields | list of JSON long in the puffin spec to be list of integers.
There was a problem hiding this comment.
Yes, we should update the spec to match the type used by the table spec for field IDs.
core/src/test/java/org/apache/iceberg/puffin/TestFileMetadataParser.java
Show resolved
Hide resolved
d3ca971 to
be6bb83
Compare
be6bb83 to
ff02934
Compare
| ByteBuffer footerPayload = PuffinFormat.compress(footerCompression, footerJson); | ||
| outputStream.write(MAGIC); | ||
| int footerPayloadLength = footerPayload.remaining(); | ||
| writeFully(footerPayload); |
There was a problem hiding this comment.
Minor: why not call IOUtil.writeFully(outputStream, footerPayload) directly rather than having a private writeFully method that just adds the output stream?
| if (!buffer.hasRemaining()) { | ||
| return; | ||
| } | ||
| byte[] chunk = new byte[WRITE_CHUNK_SIZE]; |
There was a problem hiding this comment.
Rather than allocating every time this is called, can you create a ThreadLocal to share this buffer? Alternatively, you could pass the temporary buffer in.
There was a problem hiding this comment.
Alternatively, you could pass the temporary buffer in.
This poses sizing challenge. I.e. the caller needs to provide a reasonably sized buffer.
Rather than allocating every time this is called, can you create a ThreadLocal to share this buffer?
Sure, this is feasible. Do you happen to know what would be the expected reuse ratio for such a buffer?
Alternatively we can have a static buffer pool that lends buffers to the current thread.
(assuming we have a problem that we want to fix here)
|
Thanks, @findepi! This looks good to me. We can still follow up, but I think the majority of the changes are ready so I've merged it to avoid keeping a big PR outstanding. |
Format documentation: #4944